April 2021

Thanks and acknowledgements

  • I acknowledge Aboriginal nations as the first people of Australia, and the Gadigal and Bidjigal peoples who are the Traditional Custodians of the land on which I live and work, and pay my respects to elders past, present and emerging.

  • Collaborators - David Warton, Francis Hui, Dianne Cook

  • Ecostats research group, the School of Maths and Stats, and Stats Central at UNSW.

  • Funding from the Australian Research Council.

  • Statistical society of Australia NSW branch for the invite.

Motivation

Residuals to asses model fit

lm_fit<-lm(Y~X1+X2, data=dat)
plot(lm_fit, which = 1)

Faraway’s Linear Models with R (2005, p. 59)

Quiz: Which model fits best?

Quiz: Which model fits best?

Quiz: Which model fits best?

Quiz: Which model fits best?

Motivation

When visualizing we want

  • important patterns to stand out
  • remove / minimize unimportant artefacts

When (scatter) plotting we often expect

  • continuous data
  • Normal distribution (symmetric, not skewed)
  • \(N(0,1)\) in some cases (e.g. residual plots)

GLM residuals

Pearson residuals - if model correct

  • zero mean, unit variance (\(\mu = 0\), \(\sigma=1\) )
  • not generally normally distributed
  • discrete

Other (less good) residuals

  • raw, deviance, working

Dunn - Smyth Residuals - if model correct

  • normally distributed \(\sim N(0,1)\)
  • same as for linear regression

Residual types matter

glm_fit<-glm(Y~X1 + X2, data=dat, family="poisson")

Pearson residuals

plot(residuals(glm_fit)~
       fitted(glm_fit))

Dunn - Smyth Residuals

library(DHARMa)
res_glm <- simulateResiduals(
  fittedModel = glm_fit)
plot(residuals(res_glm, qnorm)~ 
       fitted(res_glm))

Multivariate data

Multivariate data

Interesting things we want to see in multivariate data

  • relationships between variables, e.g. correlation, interactions
  • relationships between observations - clustering, covariate relationships, outliers

Discrete data properties like mean / variance relationship and discreteness can obscure these.

Penguins

Artwork by @allison_horst

library(palmerpenguins)
  • Palmer Archipelago (Antarctica) Penguin Data
  • Size measurements, clutch observations, and blood isotope ratios
  • Data were collected and made available by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program.

Penguin data wrangling

library(palmerpenguins)
penguins = penguins_raw %>% 
  transmute(bill_length = `Culmen Length (mm)`,
         bill_depth = `Culmen Depth (mm)`,
         flipper_length = `Flipper Length (mm)`,
         mass= `Body Mass (g)`,
         isoN_15v14= `Delta 15 N (o/oo)`,
         isoC_13v12=`Delta 13 C (o/oo)`,
         sex=Sex,
         Species = factor(Species,
                          labels = c("Adelie", "Gentoo", "Chinstrap")),
         Island) %>% 
  na.omit()

head(penguins)
## # A tibble: 6 x 9
##   bill_length bill_depth flipper_length  mass isoN_15v14 isoC_13v12 sex  
##         <dbl>      <dbl>          <dbl> <dbl>      <dbl>      <dbl> <chr>
## 1        39.5       17.4            186  3800       8.95      -24.7 FEMA~
## 2        40.3       18              195  3250       8.37      -25.3 FEMA~
## 3        36.7       19.3            193  3450       8.77      -25.3 FEMA~
## 4        39.3       20.6            190  3650       8.66      -25.3 MALE 
## 5        38.9       17.8            181  3625       9.19      -25.2 FEMA~
## 6        39.2       19.6            195  4675       9.46      -24.9 MALE 
## # ... with 2 more variables: Species <fct>, Island <chr>

Visualising multivariate data

Visualising multivariate data

Visualising multivariate data

Visualising multivariate data

Pairwise plots

library(GGally)
ggpairs(penguins,columns = 1:6, ggplot2::aes(colour=Species))

Factor analysis

library(ggfortify)
penguins[,1:6] %>% 
  factanal(factors = 2, scores = 'regression') %>% 
  autoplot(data = penguins, colour='Species', loadings = TRUE, loadings.label = TRUE)+theme_classic()

Tours

library(tourr)
animate_xy(penguins[,1:6], col=penguins$Species)

Discrete data

Discrete data - Hunting spiders

Discrete data - Hunting spiders

Photo credit: https://araneae.nmbe.ch

library(ecoCopula)
data(spider)
  • Abundance of hunting spiders and associated environmental variables
  • Data attributed to van der Aart & Smeenk-Enserink (1975), obtained from the spider2 directory, CANOCO FORTRAN package exported from mvabund R package.

Spider wrangle

spiders=data.frame(spider$abund,spider$x) %>% 
  mutate(leaves=ifelse(fallen.leaves>0,"leaves", "no leaves")) %>% 
  na.omit()

head(spiders)
##   Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi Pardlugu Pardmont
## 1       25       10        0        0        0        4        0       60
## 2        0        2        0        0        0       30        1        1
## 3       15       20        2        2        0        9        1       29
## 4        2        6        0        1        0       24        1        7
## 5        1       20        0        2        0        9        1        2
## 6        0        6        0        6        0        6        0       11
##   Pardnigr Pardpull Trocterr Zoraspin soil.dry bare.sand fallen.leaves   moss
## 1       12       45       57        4   2.3321    0.0000        0.0000 3.0445
## 2       15       37       65        9   3.0493    0.0000        1.7918 1.0986
## 3       18       45       66        1   2.5572    0.0000        0.0000 2.3979
## 4       29       94       86       25   2.6741    0.0000        0.0000 2.3979
## 5      135       76       91       17   3.0155    0.0000        0.0000 0.0000
## 6       27       24       63       34   3.3810    2.3979        3.4340 2.3979
##   herb.layer reflection    leaves
## 1     4.4543     3.9120 no leaves
## 2     4.5643     1.6094    leaves
## 3     4.6052     3.6889 no leaves
## 4     4.6151     2.9957 no leaves
## 5     4.6151     2.3026 no leaves
## 6     3.4340     0.6931    leaves

Hunting spiders data

Discrete data properties (we don’t want to visualise)

  • skewness (many small counts, fewer large counts),
  • variance depends on mean, small counts more similar, large counts more variable
  • discrete - some observations identical and get plotted on top of each other (or in weird lines)

Note on spider data

  • A bit different to penguin data
  • For penguin traits are the variables (e.g. flipper length), for spiders species are the variables (e.g Pardmont, short for Pardosa monticola)
  • For penguins the colours are species, for spiders the colours are environmental variables (e.g fallen leaves in plot)

Discrete pairwise plot

ggpairs(spiders,columns = 7:12, ggplot2::aes(colour=leaves))

Discrete tourr

spiders[7:12] %>% 
  animate_xy(col=spiders$leaves)

Existing fixes - transformation and jittering

Transformations

  • aim to stabilise variance (similar to pearson residuals)
  • can remove skewness \(\rightarrow\) more symmetric
  • works okay on large counts, less good on small counts, ordinal, binary…
  • can’t check transformation is ‘correct’

Jittering

  • adds noise to data, usually \(N(0,\sigma)\) or \(U(-\sigma,\sigma)\)
  • removes discreteness
  • what \(\sigma\) to use? - trial and error
  • does not always preserve ordering: \(y_1<y_2 \;\not\!\!\!\implies J(y_1)<J(y_2)\)

Outstanding question

  • which transformation / how much jittering

Modelling

Modelling

Dunn - Smyth Residuals

Modelling / Dunn-Smyth Residuals

Can think of as more principled transformation / jittering in one

Like transformations

  • stabilise variance
  • remove skewness \(\rightarrow\) more symmetric

Like jittering

  • remove discreteness

But also

  • we can check distribution is appropriate
  • works well all data types, including binary, ordinal etc.
  • preserves ordering:

\(y_1<y_2 \implies DS(y_1)<DS(y_2)\)

Dunn - Smyth Residuals

  • a.k.a. Randomised quantile residuals (Dunn & Smyth, 1996)
  • on Normal scale
  • implemented in several packages (e.g. mvabund, glarma, gamlss)
  • DHARMa package produces Dunn - Smyth residuals for a verity of models including - lm, glm, lme4, glmmTMB, mgcv, JAGS, STAN and BUGS (by default on uniform scale)

Dunn, K. P., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 1-10.

Related literature

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Dunn - Smyth Residuals

Spider residuals

library(ecoCopula)
spider_ds = stackedsdm(spiders[1:12], data=spiders,
           family="negative.binomial") %>% 
  residuals() %>% 
  data.frame() %>% 
  cbind(spiders[,-c(1:12)])

There should be a package soonish to make this easier.

Spider residuals

##   Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi Pardlugu Pardmont
## 1       25       10        0        0        0        4        0       60
## 2        0        2        0        0        0       30        1        1
## 3       15       20        2        2        0        9        1       29
## 4        2        6        0        1        0       24        1        7
## 5        1       20        0        2        0        9        1        2
## 6        0        6        0        6        0        6        0       11
##   Alopacce Alopcune Alopfabr Arctlute Arctperi Auloalbi Pardlugu Pardmont
## 1    1.564    0.911   -1.675   -1.512    0.713    0.689   -0.288    1.490
## 2   -0.519    0.137   -0.942    0.153   -1.696    1.798    0.121   -0.502
## 3    1.116    1.475    0.627    1.017   -0.636    1.061    0.101    0.936
## 4    0.068    0.646    0.152    0.695   -2.333    1.636   -0.068    0.159
## 5   -0.074    1.488   -1.932    1.025   -0.574    1.026    0.081   -0.312
## 6   -0.973    0.616   -0.127    1.692   -0.980    0.849   -0.787    0.385

Pairwise spiders - Dunn-Smyth

ggpairs(spider_ds, columns = 7:12, ggplot2::aes(colour=leaves))

Spider tourr - Dunn-Smyth

spider_ds[,7:12] %>% 
  animate_xy(col=spiders$leaves)

Other multivariate visualisations

Copulas (intuition)

There are several ways to extend GLMs to model multivariate data

  • hierarchical models (lme4, glmmTMB, mcmcglmm etc.)
  • Generalized estimating equations (geepak)
  • Copulas (gcmr, ecoCopula)

Gaussian copulas are intuitively similar to Dunn-Smyth residuals. Fitting a Gaussian copula goes something* like this

  • Calculate Dunn-Smyth residuals for each variable (e.g. species of spider) separately as before, using glm as a model.
  • Each of these sets of residuals are \(\sim N(0,1)\)
  • Assume they are jointly Normal across variables (species of spider)
  • Carry out factor analysis, graphical modelling, etc. on the residuals

*technically incorrect but intuitively illuminating

Copula as latent variable model

  • Let each variable \(j\) (e.g. species of spider) have a marginal distribution \(F_j(\cdot)\),
  • e.g. \(F_{Pardmont}\) is the Poisson distribution with mean \(\lambda =3.5\).
  • Let \(\mathbf{z}\) be a multivariate Normal latent variable, \(\mathbf{z} \sim N_D(0, \Sigma)\)
  • The latent variable \(\mathbf{z}\) gives rise to discrete data \(\mathbf{y}\) via \[y_{j} = F_j^{-1}(\Phi(z_{j}))\]
  • We estimate this discrete Gaussian copula by importance sampling using Dunn-Smyth residuals.
  • The correlation matrix \(\Sigma\) induces correlation in \(\mathbf{y}\).
  • We can use \(\Sigma\) as a basis of covariance modelling (factor analysis, graphical modelling, etc.)

Copula ordination (ecoCopula)

cols=ifelse(spiders$leaves=="leaves","#F8766D","#00BFC4")
spiders[,1:12] %>% 
  stackedsdm(family="negative.binomial", data=spiders) %>% 
  cord() %>% 
  plot(biplot = TRUE, site.col=cols)

Ordinal multivariate data & fancier plots

Some ordinal data

Estimates of cover class (ordinal) for all non-tree vascular plant species in 160 375m^2 circular sample plots in Bryce Canyon National Park, Utah, U.S.A., as well as environmental variables like elevation.

Import data and wrangle.

library(labdsv)
# site data
data(brycesite)
brycesite$plotcode=substr(brycesite$plotcode,3,5)
#species data
data(bryceveg)
bryceveg<- bryceveg[,-which(colSums(bryceveg>0) <= 20)]  #most abundant species
#recode data to integer categories
old <- c(0,0.2,0.5,1,2,3,4,5,6) #existing categories
bryceord=bryceveg
for(i in 1:length(old)){
  bryceord[bryceveg==old[i]]=i
}

bryce_ds = bryceord[,1:6] %>% 
  stackedsdm(data=brycesite, family="ordinal") %>% 
  residuals() %>% 
  data.frame() 

Some ordinal data

head(bryceord)
##         ameuta arcpat ceamar cermon chrvis juncom pacmyr purtri quegam ribcer
## bcnp__1      1      4      3      1      1      1      1      1      1      1
## bcnp__2      3      3      1      1      1      5      3      1      1      1
## bcnp__3      1      4      3      1      1      4      3      1      1      3
## bcnp__4      3      4      3      1      1      1      1      1      1      1
## bcnp__5      1      7      3      1      1      4      3      3      1      1
## bcnp__6      3      4      4      1      1      3      3      3      1      1
##         symore artarb berrep tetcan carrss koenit oryhym poafen sithys sticom
## bcnp__1      4      1      4      1      3      1      1      3      1      1
## bcnp__2      3      1      1      1      3      1      1      1      1      1
## bcnp__3      3      1      3      1      3      1      1      1      1      1
## bcnp__4      3      1      4      1      3      1      1      1      1      1
## bcnp__5      3      1      3      1      4      1      1      1      1      1
## bcnp__6      3      1      4      1      3      1      1      1      1      1
##         achmil astmis calnut caslin drasub eriala erirac euplur hymric leppun
## bcnp__1      1      3      1      1      1      1      1      1      1      1
## bcnp__2      3      1      1      1      1      1      1      3      1      1
## bcnp__3      3      1      1      1      1      1      1      1      1      1
## bcnp__4      1      1      1      1      1      1      1      1      1      1
## bcnp__5      1      3      1      3      1      1      1      3      1      1
## bcnp__6      1      1      1      1      1      1      1      1      1      1
##         linlew lotuta macgri oenlav pedcan senmul swerad
## bcnp__1      1      1      1      1      1      1      1
## bcnp__2      1      1      1      1      1      3      3
## bcnp__3      1      1      1      1      1      3      1
## bcnp__4      1      1      1      1      1      3      1
## bcnp__5      1      1      1      1      3      3      3
## bcnp__6      1      1      1      1      1      1      3

Ordinal pairs plot (raw)

  ggpairs(bryceord[,1:6], ggplot2::aes(colour=brycesite$elev), 
        upper=list(continuous = "blank"),
        axisLabels="none")+
    scale_color_gradientn(colours = brewer.pal(n = 10, name = "PuOr"))+
  theme_bw()

Ordinal pairs plot (DS residuals)

  ggpairs(bryce_ds, ggplot2::aes(colour=brycesite$elev), 
        upper=list(continuous = "blank"),
        axisLabels="none")+
  scale_color_gradientn(colours = brewer.pal(n = 10, name = "PuOr"))+
  theme_bw()

Ordinal ordination

Ordination

bryce_ordi = bryceord %>%
  stackedsdm(data = brycesite, family="ordinal") %>%
  cord()

Plotting

site_res <- data.frame(bryce_ordi$scores,brycesite)
sp_res <- data.frame(bryce_ordi$loadings,
                     species = colnames(bryceord))
alpha= 2.5 # scaling parameter
ggplot()+
  geom_point(aes(x=Factor1,y=Factor2,color = elev ),site_res)+ #sites
  geom_text(aes(x = Factor1*alpha, y = Factor2*alpha,label = species),data=sp_res)+ #species
  scale_color_gradientn(colours = brewer.pal(n = 10, name = "PuOr"))+
  theme_classic()

Ordinal ordination

References

Dai, F., Zhu, Y., & Maitra, R. (2019). Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Data. arXiv.

Hirschfeld, H.O. (1935) “A connection between correlation and contingency”, Proc. Cambridge Philosophical Society, 31, 520–524

Shepard, R. N. (1962). The analysis of proximities: multidimensional scaling with an unknown distance function. I. Psychometrika, 27(2), 125-140.

Warton, D. I., Wright, S. T., & Wang, Y. (2012). Distance‐based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3(1), 89-101.

Liu, H., Lafferty, J., & Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(10).

Thanks and acknowledgements

  • I acknowledge Aboriginal nations as the first people of Australia, and the Gadigal and Bidjigal peoples who are the Traditional Custodians of the land on which I live and work, and pay my respects to elders past, present and emerging.

  • Collaborators - David Warton, Francis Hui, Dianne Cook

  • Ecostats research group, the School of Maths and Stats, and Stats Central at UNSW.

  • Funding from the Australian Research Council.

  • Statistical society of Australia NSW branch for the invite.